{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Random forests\n",
    "\n",
    "In this notebook, we will present the random forest models and show the\n",
    "differences with the bagging ensembles.\n",
    "\n",
    "Random forests are a popular model in machine learning. They are a\n",
    "modification of the bagging algorithm. In bagging, any classifier or regressor\n",
    "can be used. In random forests, the base classifier or regressor is always a\n",
    "decision tree.\n",
    "\n",
    "Random forests have another particularity: when training a tree, the search\n",
    "for the best split is done only on a subset of the original features taken at\n",
    "random. The random subsets are different for each split node. The goal is to\n",
    "inject additional randomization into the learning procedure to try to\n",
    "decorrelate the prediction errors of the individual trees.\n",
    "\n",
    "Therefore, random forests are using **randomization on both axes of the data\n",
    "matrix**:\n",
    "\n",
    "- by **bootstrapping samples** for **each tree** in the forest;\n",
    "- randomly selecting a **subset of features** at **each node** of the tree.\n",
    "\n",
    "## A look at random forests\n",
    "\n",
    "We will illustrate the usage of a random forest classifier on the adult census\n",
    "dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "adult_census = pd.read_csv(\"../datasets/adult-census.csv\")\n",
    "target_name = \"class\"\n",
    "data = adult_census.drop(columns=[target_name, \"education-num\"])\n",
    "target = adult_census[target_name]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"admonition note alert alert-info\">\n",
    "<p class=\"first admonition-title\" style=\"font-weight: bold;\">Note</p>\n",
    "<p class=\"last\">If you want a deeper overview regarding this dataset, you can refer to the\n",
    "Appendix - Datasets description section at the end of this MOOC.</p>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "The adult census contains some categorical data and we encode the categorical\n",
    "features using an `OrdinalEncoder` since tree-based models can work very\n",
    "efficiently with such a naive representation of categorical variables.\n",
    "\n",
    "Since there are rare categories in this dataset we need to specifically encode\n",
    "unknown categories at prediction time in order to be able to use\n",
    "cross-validation. Otherwise some rare categories could only be present on the\n",
    "validation side of the cross-validation split and the `OrdinalEncoder` would\n",
    "raise an error when calling its `transform` method with the data points of the\n",
    "validation set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import OrdinalEncoder\n",
    "from sklearn.compose import make_column_transformer, make_column_selector\n",
    "\n",
    "categorical_encoder = OrdinalEncoder(\n",
    "    handle_unknown=\"use_encoded_value\", unknown_value=-1\n",
    ")\n",
    "preprocessor = make_column_transformer(\n",
    "    (categorical_encoder, make_column_selector(dtype_include=object)),\n",
    "    remainder=\"passthrough\",\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "We will first give a simple example where we will train a single decision tree\n",
    "classifier and check its generalization performance via cross-validation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.pipeline import make_pipeline\n",
    "from sklearn.tree import DecisionTreeClassifier\n",
    "\n",
    "tree = make_pipeline(preprocessor, DecisionTreeClassifier(random_state=0))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import cross_val_score\n",
    "\n",
    "scores_tree = cross_val_score(tree, data, target)\n",
    "\n",
    "print(\n",
    "    \"Decision tree classifier: \"\n",
    "    f\"{scores_tree.mean():.3f} \u00b1 {scores_tree.std():.3f}\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "Similarly to what was done in the previous notebook, we construct a\n",
    "`BaggingClassifier` with a decision tree classifier as base model. In\n",
    "addition, we need to specify how many models do we want to combine. Note that\n",
    "we also need to preprocess the data and thus use a scikit-learn pipeline."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.ensemble import BaggingClassifier\n",
    "\n",
    "bagged_trees = make_pipeline(\n",
    "    preprocessor,\n",
    "    BaggingClassifier(\n",
    "        estimator=DecisionTreeClassifier(random_state=0),\n",
    "        n_estimators=50,\n",
    "        n_jobs=2,\n",
    "        random_state=0,\n",
    "    ),\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "scores_bagged_trees = cross_val_score(bagged_trees, data, target)\n",
    "\n",
    "print(\n",
    "    \"Bagged decision tree classifier: \"\n",
    "    f\"{scores_bagged_trees.mean():.3f} \u00b1 {scores_bagged_trees.std():.3f}\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that the generalization performance of the bagged trees is already much\n",
    "better than the performance of a single tree.\n",
    "\n",
    "Now, we will use a random forest. You will observe that we do not need to\n",
    "specify any `estimator` because the estimator is forced to be a decision tree.\n",
    "Thus, we just specify the desired number of trees in the forest."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.ensemble import RandomForestClassifier\n",
    "\n",
    "random_forest = make_pipeline(\n",
    "    preprocessor,\n",
    "    RandomForestClassifier(n_estimators=50, n_jobs=2, random_state=0),\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "scores_random_forest = cross_val_score(random_forest, data, target)\n",
    "\n",
    "print(\n",
    "    \"Random forest classifier: \"\n",
    "    f\"{scores_random_forest.mean():.3f} \u00b1 \"\n",
    "    f\"{scores_random_forest.std():.3f}\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It seems that the random forest is performing slightly better than the bagged\n",
    "trees possibly due to the randomized selection of the features which\n",
    "decorrelates the prediction errors of individual trees and as a consequence\n",
    "make the averaging step more efficient at reducing overfitting.\n",
    "\n",
    "## Details about default hyperparameters\n",
    "\n",
    "For random forests, it is possible to control the amount of randomness for\n",
    "each split by setting the value of `max_features` hyperparameter:\n",
    "\n",
    "- `max_features=0.5` means that 50% of the features are considered at each\n",
    "  split;\n",
    "- `max_features=1.0` means that all features are considered at each split\n",
    "  which effectively disables feature subsampling.\n",
    "\n",
    "By default, `RandomForestRegressor` disables feature subsampling while\n",
    "`RandomForestClassifier` uses `max_features=np.sqrt(n_features)`. These\n",
    "default values reflect good practices given in the scientific literature.\n",
    "\n",
    "However, `max_features` is one of the hyperparameters to consider when tuning\n",
    "a random forest:\n",
    "- too much randomness in the trees can lead to underfitted base models and can\n",
    "  be detrimental for the ensemble as a whole,\n",
    "- too few randomness in the trees leads to more correlation of the prediction\n",
    "  errors and as a result reduce the benefits of the averaging step in terms of\n",
    "  overfitting control.\n",
    "\n",
    "In scikit-learn, the bagging classes also expose a `max_features` parameter.\n",
    "However, `BaggingClassifier` and `BaggingRegressor` are agnostic with respect\n",
    "to their base model and therefore random feature subsampling can only happen\n",
    "once before fitting each base model instead of several times per base model as\n",
    "is the case when adding splits to a given tree.\n",
    "\n",
    "We summarize these details in the following table:\n",
    "\n",
    "| Ensemble model class     | Base model class          | Default value for `max_features`   | Features subsampling strategy |\n",
    "|--------------------------|---------------------------|------------------------------------|-------------------------------|\n",
    "| `BaggingClassifier`      | User specified (flexible) | `n_features` (no&nbsp;subsampling) | Model level                   |\n",
    "| `RandomForestClassifier` | `DecisionTreeClassifier`  | `sqrt(n_features)`                 | Tree node level               |\n",
    "| `BaggingRegressor`       | User specified (flexible) | `n_features` (no&nbsp;subsampling) | Model level                   |\n",
    "| `RandomForestRegressor`  | `DecisionTreeRegressor`   | `n_features` (no&nbsp;subsampling) | Tree node level               |"
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "main_language": "python"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}